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Abstract 

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We 
extend the classical analysis of Young in the presence of a fault prediction system, which is character- 
ized by its recall and its precision, and which provides either exact or window-based time predictions. 
We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste 
of resource usage due to checkpoint overhead) in all scenarios. These results lay the foundations for 
future experimental validation of the model. 

1 Introduction 

In this paper, we assess the impact of fault prediction techniques on checkpointing strategies. We assume 
to have jobs executing on a platform subject to faults, and we let /i be the mean time between faults 
(MTBF) of the platform. In the absence of fault prediction, the standard approach is to take periodic 
checkpoints, each of length C, every period of duration T . In steady-state utilization of the platform, 
the value Topt of T that minimizes the (expectation of the) waste of resource usage due to checkpointing, 
is easily computed as Topt = \/2C/^. This is the well-known Young formula T . 

Now, when some fault prediction mechanism is available, can we compute a better checkpointing 
period to decrease the expected waste? and to what extent? Critical parameters that characterize a 
fault prediction system are its recall r, which is the fraction of faults that are indeed predicted, and its 
precision p, which is the fraction of predictions that are correct (i.e., correspond to actual faults). The 
major objective of this paper is to refine the expression of the expected waste as a function of these 
new parameters, and to derive optimal values for the checkpointing period. We deal with two problem 
instances, one where the predictor system provides exact dates for predicted events, and another where 
it only provides time windows during which events take place. We succeed in characterizing optimal 
values for both instances. 

The results of this preliminary work lay the theoretical foundations for the study of the impact of 
prediction on checkpointing strategies, and are a prerequisite for conducting experimental simulations to 
fully validate the analysis for realistic application/platform scenarios. 



2 Framework 



2.1 Checkpointing strategy 

We consider a platform subject to faults. Our work is agnostic of the granularity of the platform, 
which may consist either of a single processor, or of several processors that work concurrently and use 
coordinated checkpointing. The key parameter is /i, the mean time between faults (MTBF) of the 
platform. If the platform is made of K components whose individual MTBF is ^imd, then /i = ii^. 

Checkpoints are taken at regular intervals, or periods, of length T. We use C, D, and R for the 
duration of the checkpoint, downtime and recovery (respectively). We must enforce that C < T, and 
useful work is done only T — C units of time during every period of length T, if no fault occurs. The 
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waste due to checkpointing in a fault- free execution is Waste = ^ . In the following, the waste always 
denote the fraction of time that the platform is not doing useful work. 

2.2 Fault predictor 

A fault predictor is a mechanism that is able to predict that some faults will take place, either at a certain 
point in time, or within some time-interval window. The accuracy of the fault predictor is characterized 
by two quantities, the recall and the precision: 

• The recall r is the fraction of faults that are predicted; 

• The precision p is the fraction of fault predictions that are correct. 

Traditionally, one defines three types of events: (i) true positive events are faults that the predictor 
has been able to predict (let Trucp be their number); (ii) false positive events are fault predictions that 
did not materialize as actual faults (let Falscp be their number); and (iii) false negative events are faults 
that were not predicted (let FalscN be their number). With these definitions, we have 

Trucp ^ Trucp 

Trucp + FalscN ^ Trucp + Falscp 

We point out that the precision p is a standard notion in the literature [1] [21 El El E] • However, the 
name "precision" can be misleading. For instance, consider a predictor that provides time windows of 
length / for a platform whose MTBF is /i. The probability that a fault takes place inside the window 
is — , hence a precision p = — brings no additional information. Of course a very high precision enables 
to identify those time intervals where faults are more likely to strike, but a very low precision is useful 
too (somewhat counter- intuitively!): it enables to identify those time intervals where faults should not 
be expected. 

2.3 Fault rates 

In addition to /i, the mean time between faults (MTBF) of the platform, let /xp be the mean time 
between predicted events (both true positive and false positive) , and let /za? p be the mean time between 
unpredicted faults (false negative). Finally, we define the mean time between events as /ie (including all 
three event types). The relationships between /i, /ip, ^np, and /ie are the following: 

• ■^^'"/T^ ~ /TiT^r (here, 1 — r is the fraction of faults that are unpredicted) ; 

• M ~ (here, r is the fraction of faults that are predicted, and p is the fraction of fault predictions 
that are correct); 

• ^ = ^ + (here, events are either predicted (true or false) or not). 

3 Predictor with exact event dates 

In this section, we present an analytical model to assess the impact of prediction on periodic checkpointing 
strategies. We consider the case where the predictor is able to provide exact prediction dates, and to 
generate such predictions at least C seconds in advance, so that a checkpoint can indeed be taken before 
the event (otherwise the prediction cannot be useful, because there is not enough time to take proactive 
actions). We consider the following algorithm: 

1. While no fault prediction is available, checkpoints are taken periodically with period T. 

2. When a fault is predicted, we decide whether to take the prediction into account or not. This 
decision is randomly taken: with probability we trust the predictor and take the prediction into 
account, and, with probability 1 — g, we ignore the prediction. If we take the prediction into account, 
there are two cases. If we have enough time before the prediction date, we take a checkpoint as 
late as possible, i.e., so that it completes right at the time where the fault is predicted to happen. 
After the checkpoint, we then complete the execution of the period (see Figure [11 . Otherwise, if we 
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Figure 1: Whenever there is enough time, the algorithm takes a checkpoint just before the predicted 
failure. 
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Figure 2: Whenever there is not enough time to take a checkpoint, the algorithm executes some extra 
work. 



do not have enough time to take an extra checkpoint (e < C), then we do some extra work during 
e seconds (see Figure [5]). We account for this work as idle time in the expression of the waste, to 
ease the analysis. Our expression of the waste is thus an upper bound. 

The rationale for not always trusting the predictor is to avoid taking useless checkpoints too frequently. 
Intuitively, the precision p of the predictor must be above a given threshold for its usage to be worthwhile. 
In other words, if we decide to checkpoint just before a predicted event, there are two cases: either we 
will save time by avoiding a costly re-execution if the event does correspond to an actual fault, or we will 
lose time by unduly performing an extra checkpoint if the event does not correspond to an actual fault. 
We need a larger proportion of the former cases, i.e., a good precision, for the predictor to be really 
useful. The following analysis will determine the optimal value of g as a function of the parameters C, 
fi, r, and p. 



3.1 Computing the waste 

Our goal in this section is to compute a formula for the expected waste. Recall that the waste is 
the fraction of time that the processors do not perform useful computations, either because they are 
checkpointing, or because they recover from a fault. There are four different sources of waste (see 
Figure El): 

1. Checkpoints: During a fault-free execution, the fraction of resources used in checkpointing is: 



Unpredicted faults: This overhead occurs each time a unpredicted fault strikes, that is, on 
average, once every ^mp seconds. The time wasted because of the unpredicted fault is then the 
time elapsed between the last checkpoint and the fault, plus the downtime and the time needed for 
the recovery. The expectation of the time elapsed between the last checkpoint and the fault is equal 
to half the period of checkpoints, because the time where the fault hits the system is independent 
of the checkpointing algorithm. Finally, the waste due to unpredicted faults is: 
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T 



D + R 



(lb) 



3. Predictions taken into account: Now we have to compute the execution overhead due to a 
prediction which we trust (hence we checkpoint just before its date). This overhead occurs each 
time a prediction is made by the predictor, that is, on average, once every /ip seconds, and that we 
decide to trust it, with probability q. If the predicted event is an actual fault, we waste C + D + R 
seconds (we waste D + R seconds because the predicted event corresponds to an actual fault and 
if we have enough time before the prediction date, we waste C seconds because we take an extra 
checkpoint as late as possible before the prediction date (see Figure [T|). Note that if we do not have 
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Figure 3: Actions taken when the predictor provides exact event dates. 

enough time to take an extra checkpoint (see Figure [5]), we overestimate the waste as C seconds. 
Otherwise, if the predicted event is not an actual fauh, we waste C seconds. An actual fault occurs 
with probability p, and a false prediction is made with probability (1 — p). Averaging with these 
probabilities, we waste an expected amount of [p{C + D + R) + {1 — p)C] seconds. Finally, the 
corresponding overhead is: 

-Lq[p{C + D + R) + il-p)C] (Ic) 

4. Ignored predictions: The final source of waste is for predicted events that we do not trust. This 
overhead occurs each time a prediction is made by the predictor, that is, on average, once every fip 
seconds, and that we decide to trust it, with probability I — q. If the predicted event corresponds 
to an actual fault, we waste {-j + D + R) seconds (as for a unpredicted fault). Otherwise there is 
no fault and we took no extra checkpoint, and thus we lose nothing. An actual fault occurs with 
a probability p. The corresponding overhead is: 



— (1-9) 



p{'^+D + R) + il-p)0 



(Id) 



Summing up the overhead over the four different sources, we obtain the following equation for the 
waste: 



Waste = — 



C 

T 
1 



T 



D + R 



l-l-NP 

— q[p{C + D + R) + {l-p)C] 

MP 



1 

Hp 



pi'^+D + R) + {l-p)0 



C 1 

T ^l 



{l-rq)^+D + R+^C 
2 p 



(2) 



After simplification, we have: 

Waste = 

3.2 Validity of the analysis 

We point out that Equation ([2]) is accurate only when two events (an event being a prediction (true or 
false) or a unpredicted fault) do not take place within the same period. To ensure that this condition 
is met with a high probability, we bound the length of the period: we enforce the condition T < a/ie, 
where a is some tuning parameter. 
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In fact, the number of events during a period of length T can be modeled as a Poisson process of 
parameter the probability of having fc > faults is y ■ Hence the probability of having 

two or more faults is tt = P{X > 2) = 1 - {P{X = 0) + P{X = 1)) = 1 - (1 + ■^)e"^^, where X is the 
number of faults. Enforcing the constraint T < afi^ leads to tt < 1 — (1 + a)e^". If we assume a = 0.1 
then TT < 0.005, hence a valid approximation when bounding the period range accordingly. Indeed, with 
such a conservative value for a, we have overlapping faults every 200 periods in average, so that the 
model is accurate for 98% of the checkpointing segments, hence quite reliable. 

In addition to the previous constraint, recall that we must always enforce the condition C < T, by 
construction of the periodic checkpointing policy. Finally, note that the optimal waste may never exceed 
1, since it represents the fraction of time that is "wasted". When the waste equals 1, the application no 
longer makes progress. 

3.3 Waste minimization 

3.3.1 Computation of the extremum period Tgxtr 

We have the expression of the waste from Equation 



Waste (T) = 



C 1 
T ^ 

Differentiating twice with respect to T, we obtain 
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Waste' (T) = 
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Waste" (T) 
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We obtain that Waste" (T) is strictly positive, hence Waste (T) is a convex function of T and admits 
a unique minimum on its domain [C, a/Xe]. We also compute Textr, the extremum value of T that is the 
unique zero of the function Waste' (T) : 



Pp.-xtr — 



1 — rq 



Note that this Equation makes sense even when 1 ~ rq — 0: indeed this would mean that both r = 1 
and q = I: the predictor predicts every fault, and we take proactive action for each one of them. Then 
there should never be any periodic checkpointing! 

Finally, note that Textr may well not belong to the admissible domain [C, a//]. The optimal waste 
WASTEopt is determined via the following case analysis. 



3.3.2 Computation of Wasteopt 

We rewrite the waste as an afhne function of q 

Waste (g) = ' 



T D + R 
2/i fi 



For any value of T, we deduce that Waste (g) is minimized either for q = or for g = 1. This (some- 
what unexpected) conclusion is that the predictor should sometimes be always trusted, and sometimes 
never, but no in-between value for q will do a better job. Thus we need to minimize the two functions 
WASTE|q^o} and WASTEjq^i} Over the domain of admissible values for T, and to retain the best result. 
We have: 



WASTE{,^0}(r) 



c 

T 



D + R 



The function WastE{^^o}(^) is ^ convex function and reaches its minimum for Topti in the interval 
[C, afj]: 
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• If (C < Tcxtr < "Me): Toptl = Tcxtr = \/2mC 

• If (Tcxtr < C): Topti = C 

• If (Tcxtr > a^e)- Toptl = a^e 

Thus, WASTE^q^ij, is minimized for: 

Topti = min |a^e,max{A/2^, C}| 

Similarly, we have: 



WASTE{,^l}(r) = ^ + i 



{l-r)^+D + R+-C 
2 p 



The function WASTE{-q^x}(2^) is a- convex function and reaches its minimum for Topt2 in the interval 
[C, a^e]- 



• If (C < Tcxtr < a/^e): Topt2 = Toxtr 

• If (Tcxtr < C*): Topt2 = C 

• If (Tcxtr > a^e)- Topt2 = afle 

Thus, WASTEjg^x} is minimized for: 



_ / 2tJ.C 
1-r 



Topt2 = min ja/Xg, max{ ^ C} | 



Finally, the optimal waste is: 

WASTEopt = min{WASTE{g^o}(T'optl), WASTE{,^l}(Topt2)} 

3.4 Prediction and preventive migration 

In this section, we make a short digression and briefly present an analytical model to assess the impact 
of prediction and preventive migration on periodic checkpointing strategies. As before, we consider a 
predictor that is able to predict exactly when faults happen, and to generate these predictions at least 
C seconds before the event dates. 

The idea of migration consists in moving a task for execution on another node, when a fault is 
predicted to happen on the current node in the near future. Note that the faulty node can later be 
replaced, in case of a hardware fault, or software rejuvenation can be used in case of a software fault. 
We consider the following algorithm, which is very similar to that used in Section IXTl 

1. When no fault prediction is available, checkpoints are taken periodically with period T. 

2. When a fault is predicted, we decide whether to execute the migration or not. The decision is a 
random one: with probability q we trust the predictor and do the migration and, with probability 
1-q, we ignore the prediction. If we take the prediction into account, we execute the migration as 
late as possible, so that it completes right at the time when the fault is predicted to happen. 

As before, we have four different sources of waste. Summing the overhead of the execution of these 
different sources, we obtain the following equation for the waste (where M is the duration of a migration): 



Waste 



C 
T 
1 



T 

- + D + R 



Hnp 

^q[p{M) + {I - p)M] 
Hp 



;f (1-9) 
Hp 



p{^+D + R) + {l-p)Q 
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After simplification, we get: 



C 1 
Waste = — i — 
T 



D + R 



—M 

P 



(4) 



Equation Q is very similar to Equation and the minimization of the waste proceeds exactly as 
in Section [3.31 In a nutshell, Waste(T) is again a convex function and admits a unique minimum over 

its domain [C, a/ie], the unique zero of the derivative has the same value Textr = ij and for any 
value of r, the waste is minimized for either g = or q = 1. We conduct the very same case analysis as 
in Section [2131 



4 Predictor with a prediction window 

In the previous section, we have supposed that the predictor was able to predict exactly when faults 
will strike. Here, we suppose (maybe more realistically) that the predictor gives a prediction window, 
that is an interval of time of length / during which the predicted fault is likely to happen. As before 
in Section [21 we suppose that we have enough time to checkpoint before the beginning of the prediction 
window. Also, as in Section [31 when a prediction is made, we enforce that the scheduling algorithm has 
the choice either to take or not to take this prediction into account, with probability q. 

We start with a description of the strategies that can be used, depending upon the (relative) length 
/ of the prediction window. Let us define two modes for the scheduling algorithm: 

Regular: This is the mode used when no fault prediction is available, or when a prediction is available 
but we decide to ignore it (with probability 1 — q). In regular mode, we use periodic checkpointing 
with period Tnp- Intuitively, Tnp corresponds to the checkpointing period T of Section [31 

Proactive: This is the mode used when a fault prediction is available and we decide to trust it - decision 
taken with probability q -. Consider such a trusted prediction made with the prediction window 
[io, t{) + I]- There are several strategies that can be envisioned: 

1. Instantaneous- The first strategy is to ignore the time-window and to execute the same al- 
gorithm as if the predictor had given an exact fault date at time tg. Just as described in 
Section [31 the algorithm interrupts the current period (of scheduled length Tnp), checkpoints 
during the interval [t^ — C, tg], and then returns to regular mode: at time t^, it resumes the 
work due to complete the interrupted period. 

2. No checkpoint during prediction window- The second strategy is intended for a short prediction 
window: instead of ignoring it, we acknowledge it, but make the decision not to checkpoint 
during it. As in the first strategy, the algorithm interrupts the current period (of scheduled 
length Tnp), and checkpoints during the interval [t^ — C, io]- But here, we return to regular 
mode only at time t^ -I- /, where we resume the work due to complete the interrupted period 
of the regular mode. During the whole length of the time-window, we execute work without 
checkpointing, at the risk of losing work if a fault indeed strikes. But for a small value of /, 
it may not be worthwhile to checkpoint during the prediction window (if at all possible, note 
that there is no choice if J < C). 

3. With checkpoints during prediction window- The third strategy is intended for a longer predic- 
tion window: as before, the algorithm interrupts the current period (of scheduled length Tnp), 
and checkpoints during the interval [t^ — C, t^] , but then decides to take several checkpoints 
during the prediction window. The period Tp of these checkpoints in proactive mode will pre- 
sumably be shorter than Tnp, to take into account the higher fault probability. To simplify 
the presentation, we will use an integer number of periods of length Tp within the prediction 
window. In the following, we analytically compute the optimal number of such periods. But 
we take at least one period here, hence one checkpoint, which implies that C < I. We return 
to regular mode either right after the fault stroke within the time window [to ,to + I], or at 
time to + 1 whenever no actual fault happens within the prediction window. Then, we resume 
the work due to complete the interrupted period of the regular mode. 
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Figure 4: Outline of the behavior of Algorithni[T] (third strategy) (checkpoints taken during the prediction 
window in proactive mode). 



The third strategy is the most complex to describe, and the complete behavior of the scheduling 
algorithm is shown in Algorithm [T] Note that for all strategies, exactly as in Section [3J we insert some 
additional work for the particular case where there is not enough time to take a checkpoint before 
entering proactive mode (because a checkpoint for the regular mode is currently on-going, see Figure [5]). 
We account for this work as idle time in the expression of the waste, to ease the analysis. Our expression 
of the waste is thus an upper bound. 



Algorithm 1: Proactive algorithm. 



if fault happens then 

After downtime, execute recovery; 
Enter regular mode; 
if in proactive mode for a time greater than or equal to I then 
I Switch to regular mode 
6 if Prediction made with interval [i, t + I] and prediction taken into account then 

Let tc be the date of the last checkpoint under regular mode to start no later than t ~ C; 
ii tc + C < t — C then (time for an extra checkpoint) 
I Take a checkpoint starting at time t — C 
else (no time for the extra checkpoint) 
I Work in the time interval [tc + C, t] 
Wreg ^ max (0, t - C - {tc + C)) ; 
Switch to proactive mode at time i; 
14 while in regular mode and no predictions are made and no faults happen do 
Work for a time TNp-Wreg-C and then checkpoint; 

Wreg ^ 0; 



17 while in proactive mode and no faults happen do 

18 I Work for a time Tp-C and then checkpoint; 



First we compute the waste incurred by the three algorithms, starting with the most complex strategy 
(Section 14. ip . and then simplifying the formula and establishing the result for the other two strategies 
(Section l4.2p . Then we discuss the validity of the model in Section Finally, we solve the optimization 
problem and derive optimal values for the parameter q, and for the two periods Tp and Tnp fSection l4.4p . 

4.1 Computing the waste with checkpoints during prediction window 

In this section we focus on computing the waste of the most complex strategy, that with checkpoints 
during prediction window (Algorithm [1]) . As in Section [3l we assume that there is a single event of any 
type (either a prediction (true or false), or an unpredicted failure). As already mentioned, we discuss 
this hypothesis in Section 

We first compute which fraction of the time the algorithm spends in either mode: 
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• the fraction of time spent in the regular mode (checkpointing with period Inp); 

• the fraction of time spent in the proactive mode (checkpointing with period Tp). 

Let /' be the average time spent in the proactive mode. When a prediction is made, we may choose 
to ignore it, which happens with probabihty 1 — q. In this case, the algorithm stays in regular mode 
and does not spend any time in the proactive mode. With probability g, we may decide to take the 
prediction into account. In this case, if the prediction is a false positive event (no actual fault strikes), 
which happens with probability 1 — p, then the algorithm spends I units of time in the proactive mode. 
Otherwise, if the prediction is a true positive event (an actual fault hits the system), which happens 

if) ( f) 

with probability p, then the algorithm spends an average of E} in the proactive mode. Here E} is the 

expectation of the time elapsed between the beginning of the prediction window and the time when a 

fault happens, knowing that a fault happens in the prediction window. Note that if faults are uniformly 

distributed across the prediction window, then Ej"'^'' = ^. Altogether, we obtain: 

I' = {l-q)-0 + q(^{l-p)-I+p-E\^^) =q(^il-p)I + pE\f'>) . 

Finally, each time there is a prediction, that is, on the average, every /xp seconds, the algorithm spends 
a time /' in the proactive mode. Therefore, Algorithm [T] spends a fraction of time ^ in the proactive 

mode, and a fraction of time 1 — in the regular mode. We now identify the four different sources of 
waste, and we analyze their respective costs. 

1. Waste due to periodic checkpointing. There are two cases, depending upon the mode of 
Algorithm [TJ 

(a) Regular mode. In this mode, we take periodic checkpoints. We take a checkpoint of size 
C each time the algorithm has processed work for a time Tnp — C in the regular mode. This 
remains true if, after spending some time in the regular mode, the algorithm switches to the 
proactive mode, and later switches back to the regular mode. This behavior is enforced by 
recording the amount of work performed under the regular mode (variable Wreg, at line [T^ of 
Algorithm [TJ , and by taking this value into account at line 1151 

Taking into account the fraction of time that Algorithm [T] spends in the regular mode, this 
source of waste has a total cost of: 

^--)^- (5a) 
tJ-p / Jnp 

(b) Proactive mode. In this mode, we take a checkpoint of size C each time the algorithm has 
processed work for a time Tp — C. 

If no fault happens while the algorithm is in the proactive mode, then the algorithm stays 
exactly a time / in this mode (thanks to the condition at line U]) . The waste due to the 
periodic checkpointing is exactly ^ (because Tp divides /). 

If a fault happens while the algorithm is in proactive mode, then, the expectation of the waste 
due to the periodic checkpointing is upper-bounded by the same quantity ^ . This is an over- 
approximation of the waste in that case, because the fault may strike before full completion 
of the last period. 

Overall, taking into account the fraction of time Algorithm [1] is in the proactive mode, the 
cost of this source of waste is: 

Waste incurred when switching to the proactive mode. Each time we take into account 
a prediction (which happens with probability q on average every fip units of time), we start by 
doing one preliminary checkpoint if we have the time to do so (linelH]). If we do not have the time 
to take an additional checkpoint, the algorithm do not do any processing for a duration of at most 
C (line ITlT) . In both cases, the wasted time is at most C and this happens once every Hence, 
switching from the regular mode to the proactive one induces a waste of at most 

—C (5c) 
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3. Waste due to predicted faults. Predicted faults liappen witli frequency As we may choose 
to ignore a prediction, there are still two cases depending on the mode of the algorithm at the time 
the fault hits the system. 

(a) Regular mode. If the algorithm is in regular mode when a predicted fault hits, this means 
that we have chosen to ignore the prediction, a decision taken with probability (1 — q). 
The time wasted because of the predicted fault is then the time elapsed between the last 
checkpoint and the fault, plus the downtime and the time needed for the recovery. The 
expectation of the time elapsed between the last checkpoint and the fault is equal to half the 
period of checkpoints, because the time where the fault hits the system is independent of the 
checkpointing algorithm. Therefore, the waste due to predicted faults hitting the system in 
regular mode is: 



, D + R\ (5d) 
MP V 2 / 

(b) Proactive mode. If the algorithm is in proactive mode when a fault hits, then we have 
chosen to take the prediction into account, a decision that is taken with probability q. 
The time wasted because of the predicted fault is then, in addition to the downtime and the 
time needed for the recovery, the time elapsed between the last checkpoint and the fault or, 
if no checkpoint had already been taken in the proactive mode, the time elapsed between the 
start of the proactive mode and the fault. 

Here, we can no longer assume that the time the fault hits the system is independent of the 
checkpointing date. This is because the proactive mode starts exactly at the beginning of 
the prediction window. Let Tjost denote the computation time elapsed between the latest of 
the beginning of the proactive mode and the last checkpoint, and the fault date. Then the 
expectation of Tiost depends on the distribution of the fault date in the prediction window. 
However, we know that whatever the distribution, Tjost < Tp. Therefore we over approximate 
the waste in that case by: 

^{Tp + D + R) (5e) 
MP 

Waste due to unpredicted faults. There are again two cases, depending upon the mode of the 
algorithm at the time the fault hits the system. 

(a) Regular mode. In this mode the work done is periodically checkpointed with period Tnp- 
The time wasted because of an unpredicted fault is then the time elapsed between the last 
checkpoint and the fault, plus the downtime and the time needed for the recovery. As before, 
the expectation of this value is Tiost = ■ 

An unexpected fault hits the system once every M-/vp seconds on the average. Taking into 
account the fraction of the time the algorithm is in regular mode, the waste due to unpredicted 
faults hitting the system in regular mode is: 

1-^)^(^ + D + r) (5f) 
MP/ MWP V 2 / 

(b) Proactive mode. Because of the assumption that a single event takes place within a time- 
interval, we do not consider the very unlikely case where a unpredicted fault strikes during a 
prediction window. This amounts to assume that -^^(Tp + D + R) is negligible. 

Finally, we gather the expressions of the different types of waste of Equations (|5a|) through ([Sj to 
obtain the formula of the overall waste: 

WAST.^,,c.,.^(l--)^ + -^ + ^C + P-^(^ + D + R)+^iTp + D + R) 
\ fJ'P J Jnp MP Tp fip MP V 2 / MP 

l-^]^(^ + D + R 



fJ-p J fJ-NP \ 2 



iwithckpt = [ tt; ^ 7f- i — i ip . 

' ' MP / Tnp MP Tp ^ip J flp 2 fip \ fip J fi^p 2 

^+(l-^)^){D + R) (6) 
MP V MP/ MiVP/ 
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4.2 Computing the waste of the other strategies 

The waste of the first strategy (Instantaneous) is very close to the one given in Equation The 
difference lies in Tiost, the expectation of the work lost when a fault is predicted and the prediction is 
taken into account. When a prediction is taken into account and the predicted event is an actual fault, 
the waste in Equation ([2]) was -^{C + D + R) (see Equation (flcl) ). Because the prediction was exact, 
Tiost was equal to 0. However in our new Equation, the waste for this part is now — (C + Tlost + D + R). 
On average, the fault occurs after a time Wj . However, because we do not know the relation between 
E^-^^ and Tnp, then Tiost has expectation ^ if ^ < e'/K The new waste is then: 



C 1 

WASTEinstant = 7^, ^ " 



{I - rq)^ + D + R + -jC + qriJiin \^y',^j 



(7) 



As for the second strategy {No checkpoint during prediction window), we do no longer incur the 
waste of Equation (j5b[) as we no longer checkpoint in proactive mode. Furthermore, the value of Tiost 
in Equation ([5e)) becomes Ky instead of Tp . Consequently, the total waste when there is no checkpoint 
during the proactive mode is: 

WASTE^.c.,t=[l--]^ + ^C + P-^^(^ + D + R] + Pl(EY^+D + R 



r ^ 








Tnp 


MP 


-) 




f Tnp 






I 2 



Hp \ z /MP 
D + R 



MP \ MP/ MiVP 



which we rewrite as 

Vv MP/ Tnp MP/ Mp 2 \ fip J fi^p 2 

' ' ^l-^]^]iD + R) (8) 

Note that when / = 0, the first and second strategies collapse. Indeed, we have Ey ' = if / = 0, 
and we check that Equations ([7]) and ([5]) are identical in that case. 

4.3 VaUdity of the results 

In this subsection, we discuss the validity of the model. The analysis is similar to that of Section 13.21 
except that we deal with different length intervals here. As before, we assume that there is a single event 
of any type within each interval under study. The condition T < a/j-e then becomes 

Tnp + T < aMe (9) 
as Tnp + T is the longer time interval considered in the analysis of Algorithm [T] 

4.4 Waste minimization 

In this section we aim at minimizing the waste of the three strategies, and then we find conditions to 
characterize which one is better. Recall that I' ~ q ((1 — p)I + pE^/'^ 



4.4.1 With checkpoints during prediction window (Algorithm [T|) 

In order to compute the optimal value for Tp, let us find the portion of the waste that depends on Tp: 

rq ( (1 -p)/+pE*/^ C \ 
M \ P Tp I 
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As we can see, the optimal value for Tp is independent from q, but also from The optimal value for 
Tp is thus: 

= ^^I^I^MTc (10) 

However, for our algorithm to be correct, we want G N (the interval / is partitioned in k intervals of 
length Tp, for some integer k). We choose Tp''* equal to either -i — i- — r or -. —. — , depending on the 

value that minimizes Wastetp ■ Note that we also have the constraint Tp^' > C, hence if both values 
are lower than C, then Tp''* = C. 

Now that we know that Tp''* is independent from both q and Tnp, we can see the waste in Equation ([6]) 
as a function of two variables. One can see from Equation ([SJ that the waste is an affine function of q. 
This means that the minimum is always reached for either q = or q = 1. We now consider the two 
functions WASTEwithCkpt{(}=o} ^nd WASTEwithCkpt{(}=i} in order to minimize them with respect to Tnp- 
First we have: 

C 1 / T \ 

WASTEwithCkpt{g=0} ^7^ + -i^ + D + R] (11) 
iNP V ^ / 

As expected, this is exactly the equation without prediction, the study of the optimal solution has been 
done in Section |31 it is minimized when T^p*" = min (a^e — /, max (\/2C/x, C) ) . 
Next we have: 



WASTE^ithCkpt{9=l} 




((l-p)/+pE^- 




il-p)I+pEY> c 



T°P* I + —C (12) 



p TpP* ^ j pfi 



r 



{l-p)I+pWf 



(/)' 



PH 



(D + R) 



This equation is minimized when 



rriOptj 

-'np 



' 2^C 

(1-0 



One can remark that this value is equal to the result without intervals (Section [3]). Actually, the only 
impact of the prediction interval / is the moment when we should take a pre-emptive action. Note that 
when r = (this means that there is no prediction), we have T^p*^ = ^np*"' ^^'^ retrieve Young's 
formula [4 . 

Finally, we know that the waste is defined for C < Tnp < a/ig — /. Hence, if T^p'^ ^ [C, a/ig — /], this 
solution is not satisfiable. However Equation (fT2)) is convex, so the optimal solution is C if Tj^p'^ < C, 
and a/ie — / if T-^p*^ > ct^e- Hence, when q = 1, the optimal solution should be 




afie - I,m8ix\ . — -,C . (13) 



4.4.2 Waste for the algorithm that does not checkpoint during the proactive mode 

One can see that Equation ^ and Equation ^ only differ: by the quantity ^ ( '■^-p^'+p^^" c ^ yopt _ 

\ p 

This value is linear in q and a constant with regards to Tnp. Hence the minimization is almost the same. 
Once again we can see that the optimal value for q is either or 1. We can consider the two functions 

WASTEnoCkpt{9=0} and WASTEnoCkpt{g=l}- We remark that WASTEnoCkpt{g=0} = WASTEwithCkpt{g=0}, 
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and hence that the study has aheady been done. As for WASTE,joci.pt{-g^x|, it is also minimized when 
Finally, the last step of this study is identical to the previous minimization, and the optimal solution 



2^C 
(1^)^ 



when g = 1 is defined by T^p*^ — min ( a^e — I, max ( \l j:;-^ — 7, C 



4.4.3 Identifying the most efficient algorithm 



Finally in this section, we consider the waste for the two algorithms that take the prediction window into 
account (the one that does not checkpoint during the prediction window, and the one that checkpoints 
during the prediction window), and try to find conditions of dominance of one strategy over the other. 
Since the equation of the waste is identical when q = 0, let us consider the case when q = 1. We have 
seen that: 



WASTE^ithCkpt{,=l} - WASTE„oCkpt{,=l} = ^ ^ + ' f^p" " ^7 (14) 



((i-p)/ + pe/)) 

We want to know when Equation (jl4p is nonnegative (meaning that it is beneficial not to take any check- 

/(l — p)J + pE*''^'' 

points during proactive mode). We know that this value is minimized when T^"^"^ = \ — C 

V P 

(Equation (fTOj) ). then a sufhcient condition would be to study the equation WASTE„ithCkpt{9=i} — 
WASTEnoCkpt{9=i} > with T^''^"' instead of T°^\ That is: 



r{l-p)I + pEY^ C r I l{l-p)I+pE\- 



P^" / (l-p)/+pE/^ ^ 




> 



^2fl^J^}I±^C>EY^ (15) 

Consequently, we can say that if Equation (jlSp is matched, then WASTEnoCkpt < Waste, the algo- 
rithm where we do not checkpoint during the proactive mode has a better solution than Algorithm [TJ 
For example, if we assume that faults strike uniformly during the prediction window [to, to + 1], in other 
words, if < x < /, the probability that the fault occurs in the interval [to, to + x] is j, then E^'^-' = |, 
and our condition becomes 

1 - Ph 

I < 16 '-C. 

P 

We can now finish our study by saying that in order to find the optimal solution, one should compute 
both optimal solutions for q — and q = 1, for both algorithms, and choose the one that minimizes 
the waste, as was done in Section [31 except when Equation (jlSp is valid, then we can focus on the 
computation of the waste of the algorithms that does not checkpoint during proactive mode. 



5 Related work 

Considerable research has been conducted on fault prediction using different models (system logs analysis, 
event-driven approach). In this section we give a brief overview of the results obtained by predictors. 
We focus on their results rather than on their methods of prediction. 

The authors of [5] introduce the lead time, that is the time between the prediction and the actual 
fault. It is a time that should be sufficient to take proactive actions. They are also able to give the 
location of the fault. The fact that the location is given has an impact on the precision and the recall 
(see Table [5]). The authors of [S] also consider also a lead time, and introduce a prediction window 
when the predicted fault should happen. This motivates the work on Section UJ even though they never 
give the size of their prediction window. Unfortunately, much of the work done on prediction does not 
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Paper 


Lead Time 


Precision 


Recall 


Prediction Window 


i 


300 s 


40 % 


70% 




m 


600 s 


35 % 


60% 




m 


2h 


64.8 % 


65.2% 


yes (size unknown) 





min 


82.3 % 


85.4 % 


yes (size unknown) 


m 


32 s 


93 % 


43 % 




m 


NC 


70 % 


75 % 





Table 1: Comparative study of different parameters returned by some predictors. 

provide information that could be really useful for the design of good algorithms. These informations are 
those stated above, namely the lead time and the size of the prediction window, but other information 
that could be useful would be the distribution of the faults in the prediction window, the precision as a 
function of the recall (see our analysis), or even the precision and recall as functions of the prediction 
window (what happens with a bigger prediction window). Again, all these informations could be useful 
in the design of good algorithms. 

While many study on fault prediction focus on the conception of the predictor, most of them consider 
that the proactive action should simply be a checkpoint of a migration right before the fault. However, 
in their paper [2], Li et al. consider the mathematical problem to determine when and how to migrate. 
In order to be able to use migration, they stated that at every time, 2% of the resources are available. 
This allowed them to conceive a Knapsack-based heuristic. Thanks to their algorithm, they were able to 
save 30% of the execution time compared to an heuristic that does not take the reliability into account, 
with a precision and recall of 70%, and with a maximum load of 0.7. 

6 Conclusion 

The comprehensive analytical results provided in this preliminary report enable to fully asses the impact 
of fault prediction on optimal checkpointing strategies. Future work will be devoted to instantiate these 
results with realistic application/platform scenarios, and to provide an experimental evaluation of the 
importance of fault prediction to reduce checkpoint overhead. 
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